Using 

Web-Based 

Testing 

For  Large-Scale 
Assessment 


Laura  S.  Hamilton 
Stephen  P.  Klein 
William  Lorie 


RAND 

EDUCATION 


RAND  issue  papers  explore  topics  of  interest  to  the  policy¬ 
making  community.  Although  issue  papers  are  formally 
reviewed,  authors  have  substantial  latitude  to  express  provoca¬ 
tive  views  without  doing  full  justice  to  other  perspectives.  The 
views  and  conclusions  expressed  in  issue  papers  are  those  of  the 
authors  and  do  not  necessarily  represent  those  of  RAND  or  its 
research  sponsors. 

©  Copyright  RAND  2000.  All  rights  reserved.  No  part  of  this 
book  may  be  reproduced  in  any  form  by  any  electronic  or 
mechanical  means  (including  photocopying,  recording,  or  infor¬ 
mation  storage  and  retrieval)  without  permission  in  writing  from 
RAND. 


RAND  is  a  nonprofit  institution  that  helps  improve  policy  and 
decisionmaking  through  research  and  analysis.  RAND®  is  a 
registered  trademark.  RAND’s  publications  do  not  necessarily 
reflect  the  opinions  or  policies  of  its  research  sponsors.  RAND 
URL:  www.rand.org 


August  2000 


Using 

Web-Based  Testing 
For  Large-Scale 

Assessment 

- - 


By 

Laura  S.  Hamilton 
Stephen  P.  Klein 

AND 

William  Lorie 


DISTRIBUTION  STATEMENT  A 

Approved  for  Public  Release 
Distribution  Unlimited 


RAND 

EDUCATION 


Building  on  more  than  25  years  of  research  and  evaluation 
work,  RAND  Education  has  as  its  mission  the  improvement 
of  educational  policy  and  practice  in  formal  and  informal  set¬ 
tings  from  early  childhood  on. 


Acknowledgments 


This  work  was  supported  by  the  National  Science 
Foundation,  Division  of  Elementary,  Secondary,  and 
Informal  Education,  under  grant  ESI-98 13981.  We 
are  grateful  to  Tom  Glennan  for  suggestions  that 
improved  this  paper. 


Table  of  Contents 


i.  Introduction .  1 

ii.  The  Context  of  Large-Scale  Testing  ....  2 

How  Tests  Are  Used  .  3 

How  Testing  Is  Done .  4 

hi.  A  New  Technology  of  Testing  .  7 

Adaptive  Versus  Linear  Administration .  9 

Item  Format  . 10 

Possible  Scenario  for  Web-Based  Test 

Administration . 11 

Potential  Advantages  over  Paper-and-Pencil 

Testing . 13 

iv.  Issues  and  Concerns  with  Computerized 

Adaptive  Testing  . 17 

Psychometric  Issues  Related  to  CATs  . 17 

Item  Bank  Development  and  Management  .  .  22 

v.  Issues  and  Concerns  with 

Web-Based  Testing . 24 

Infrastructure  . 24 

Human  Capital . 26 

Costs  and  Charges . 28 

Reporting  Results  . 30 

vi.  Conclusion  . 31 


References 


35 


i.  Introduction 


Efforts  to  improve  the  quality  of  education  in  the 
United  States  increasingly  emphasize  the  need  for 
high-stakes  achievement  testing.  The  availability  of 
valid,  reliable,  and  cost-effective  measures  of  achieve¬ 
ment  is  critical  to  the  success  of  many  reform  efforts, 
including  those  that  seek  to  motivate  and  reward 
school  personnel  and  students.  Accurate  measurement 
of  student  achievement  is  also  important  for  initiatives, 
such  as  vouchers  and  charter  schools,  that  require  pub¬ 
licly  available  information  about  the  academic  perfor¬ 
mance  of  schools.  This  paper  begins  with  a  brief  dis¬ 
cussion  of  the  context  surrounding  large-scale  (e.g., 
statewide)  achievement  testing.  We  then  describe  a 
new  approach  to  assessment  that  we  believe  holds 
promise  for  reshaping  the  way  achievement  is  mea¬ 
sured.  This  approach  uses  tests  that  are  delivered  to 
students  over  the  Internet  and  are  tailored  (“adapt¬ 
ed”)  to  each  student’s  own  level  of  proficiency. 

We  anticipate  that  this  paper  will  be  of  interest  to 
policymakers,  educators,  and  test  developers  who  are 
charged  with  improving  the  measurement  of  student 
achievement.  We  are  not  advocating  the  wholesale 
replacement  of  all  current  paper-and-pencil  measures 
with  web-based  testing.  However,  we  believe  that 
current  trends  toward  greater  use  of  high-stakes  tests 
and  the  increasing  presence  of  technology  in  the  class¬ 
room  will  lead  assessment  in  this  direction.  Indeed, 
systems  similar  to  those  we  describe  in  this  report  are 
already  operational  in  several  U.S.  school  districts 
and  in  other  countries.  Furthermore,  we  believe  that 
although  web-based  testing  holds  promise  for 
improving  the  way  achievement  is  measured,  a  num¬ 
ber  of  factors  may  limit  its  usefulness  or  potentially 


lead  to  undesirable  outcomes.  It  is  therefore  impera¬ 
tive  that  the  benefits  and  limitations  of  this  form  of 
testing  be  explored  and  the  potential  consequences  be 
understood.  The  purpose  of  this  paper  is  to  stimulate 
discussion  and  research  that  will  address  the  many 
issues  raised  by  a  shift  toward  web-based  testing. 

After  presenting  a  brief  background  on  large-scale 
testing,  we  describe  the  new  technology  of  testing  and 
illustrate  it  with  an  example.  We  then  discuss  a  set  of 
issues  that  need  to  be  investigated.  Our  list  is  not  ex¬ 
haustive,  and  we  do  not  provide  answers  to  the  many 
questions  we  raise.  Instead,  we  hope  that  this  discus¬ 
sion  reveals  the  critical  need  for  cross-disciplinary 
research  to  enhance  the  likelihood  that  the  coming 
shift  to  an  emphasis  on  web-based  testing  will  truly 
benefit  students. 


ii.  The  Context  of  Large-Scale  Testing 


Testing  is  closely  linked  with  the  current  emphasis 
on  standards-based  reform  and  accountability. 
Nearly  every  state  in  the  United  States  has  adopted 
academic  standards  in  four  core  subjects — English, 
mathematics,  science,  and  social  studies.  Most  states 
assess  achievement  toward  these  standards  in  at  least 

_  reading  and  mathematics  at  one  or 

Testing  is  closely  linked  more  grade  levels.  The  grade  levels 

with  the  current  emphasis  at  which  tests  are  administered  and 

...  .  r  the  number  of  subjects  tested  con¬ 
ow  standards-based  reform  r  . 

tmue  to  increase.  Some  states  also 

and  accountability.  are  expioring  the  use  of  open-ended 

test  items  (e.g.,  essays)  rather  than 

relying  solely  on  the  less-expensive  multiple-choice 

format.  As  a  result  of  these  trends,  the  cost  of  assess- 


ment  for  states  has  doubled  from  approximately 
$165  million  in  1996  to  $330  million  in  2000,  in  con¬ 
stant  dollars  (Achieve,  Inc.,  1999).  California  alone 
tested  over  four  million  students  in  1999  using  the 
commercially  available  Stanford  9  tests. 

How  Tests  Are  Used 

State  and  district  test  results  are  used  to  make  deci¬ 
sions  about  schools  and  teachers.  For  example,  in 
recent  years,  state  policymakers  have  instituted  high- 
stakes  accountability  systems  for  schools  and  districts 
by  tying  various  rewards  and  sanctions  (e.g.,  extra 
funds  for  the  school  or  reassignment  of  staff)  to  stu¬ 
dent  achievement.  These  accountability  systems  typi¬ 
cally  involve  disseminating  results  to  the  public.  This 
puts  pressure  on  lagging  schools  to  improve  their  per¬ 
formance.  In  1999,  36  states  issued  school-level 
“report  cards”  to  the  public  (Achieve,  Inc.,  1999). 

Test  scores  are  also  used  to  make  important  deci¬ 
sions  about  individual  students.  Several  states 
(including  New  York,  California,  and  Massachusetts) 
are  developing  high  school  exit  examinations  that 
students  must  pass  to  earn  diplomas.  Many  of  the 

nation’s  large  school  districts  have  - 

adopted  policies  that  tie  promotion  The  use  of  test  scores  for 
from  one  grade  to  the  next  to  per-  tracking ,  promotion ,  and 
formance  on  district  or  state  tests,  graduation  is  on  the 

The  use  of  test  scores  for  tracking,  .  j  . 

rise  and  suggests  that  the 

promotion,  and  graduation  is  on 

i  i  _  i  ,  need  for  valid  and  reliable 

the  rise  and  suggests  that  the  need  ' 

for  valid  and  reliable  data  on  indi-  data  on  individual 

vidual  student  achievement  will  con-  student  achievement  will 

tinue  to  grow  (National  Research  continue  to  grow. 

Council,  1998).  _ 
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Teachers  also  develop  and  adopt  tests  that  they  use 
for  instructional  feedback  and  to  assign  grades.  In  this 
paper,  we  focus  on  externally  mandated,  large-scale 
tests  rather  than  these  classroom  assessments,  though 
both  forms  of  measurement  have  a  significant  impact 
on  students. 

How  Testing  Is  Done 

Many  large-scale  testing  programs  purchase  their  tests 
from  commercial  publishers.  The  three  largest  compan¬ 
ies  are  Harcourt  Educational  Measurement,  which 
publishes  the  Stanford  Achievement  Tests;  CTB/ 
McGraw-Hill,  which  publishes  the  Terra  Nova;  and 
Riverside  Publishing,  which  publishes  the  Iowa  Tests 
of  Basic  Skills  (ITBS).  In  some  cases,  these  publishers 
have  adapted  their  materials  to  accommodate  the 
needs  of  particular  testing  programs,  such  as  by 
adding  items  that  are  aligned  with  a  state’s  content 
standards.  Typically,  however,  states  and  districts  pur¬ 
chase  these  “off-the-shelf”  materials  and  use  them  as 
is,  even  when  it  is  evident  that  their  curricula  are  not 
aligned  especially  well  with  the  test’s  content. 

The  other  main  approach  to  large-scale  testing  is 
the  use  of  measures  developed  by  states  or  districts. 
For  example,  several  states  have  developed  tests  that 
are  designed  to  reflect  their  own  content  and  perfor¬ 
mance  standards,  and  others  have  plans  to  do  this  in 
the  future.  Some  of  the  state-developed  tests  include 
constructed-response  or  performance-based  items  that 
are  intended  to  measure  important  aspects  of  cur¬ 
riculum  standards  that  are  difficult  to  assess  well 
with  multiple-choice  tests.  Although  there  are  some 
commercially  available  constructed-response  tests, 
most  of  the  states  that  use  this  type  of  item  have 


developed  their  own  measures,  typically  with  the  help 
of  an  outside  contractor. 

Despite  the  diversity  of  tests  administered  by  states 
and  districts,  most  large-scale  testing  programs  share 
several  common  features.  All  state  programs  and 
most  district  programs  currently  rely  heavily  on  paper- 
and-pencil  exams,  although  a  few  districts  also  use 
some  hands-on  measures,  such  as  in  science.  Almost 
all  emphasize  multiple-choice  items,  and  many  rely 
solely  on  this  format.  Tests  are  typically  administered 
once  per  year,  in  the  spring,  with  results  generally 
released  in  early  to  late  summer.  Finally,  many  pro¬ 
grams  stagger  subjects  across  grade  levels;  e.g.,  math 
and  social  studies  in  grades  4  and  7,  reading  and  sci¬ 
ence  in  grades  5  and  8.  However,  a  few  states,  such  as 
California,  test  every  student  every  year  in  almost 
every  core  subject,  and  the  general  trend  is  toward 
increasing  the  number  of  grade  levels  tested.  The 
amount  of  time  and  resources  devoted  to  testing  is 
often  a  point  of  friction  between  those  who  want  to 
measure  student  progress  and  those  who  are  con¬ 
cerned  about  taking  class  time  away  from  instruction 
for  testing  (and  for  preparing  students  to  take  the  tests). 

There  are  several  limitations  to  the  current  ap¬ 
proach  to  large-scale  assessment.  First,  the  reliance 

on  paper-and-pencil  multiple-choice  - - — 

tests  limits  the  kinds  of  skills  that  .  .  .  the  reliance  on  paper- 
can  be  measured.  For  this  reason,  and-pencil  multiple- choice 

many  states  and  districts  have  tests  Umits  the  kinds  of  skMs 

experimented  with  other  formats, 

,  ,  ,  ,  ^  i  that  can  be  measured. 

such  as  hands-on  testing,  but  these 

can  be  very  expensive  (Stecher  8c 
Klein,  1997)  and  do  not  necessarily  measure  the  con¬ 
structs  that  their  developers  intended  (Hamilton, 

Nussbaum,  8c  Snow,  1997).  A  second  limitation  aris- 
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es  from  the  lag  time  between  test  administration  and 
score  reporting.  Answer  sheets  typically  must  be  sent 
to  an  outside  vendor  for  scoring,  and  the  test  results 
must  then  be  linked  with  school  and  district  data  sys¬ 
tems.  This  process  takes  even  longer  if  items  cannot 
be  machine-scored,  such  as  when  open-response 
questions  are  used.  Consequently,  students,  parents, 
and  teachers  generally  do  not  receive  scores  from 
tests  administered  in  the  spring  until  the  summer  or 
fall,  which  severely  limits  the  usefulness  of  the  results 
for  guiding  instruction.  A  third  problem  is  the  dis¬ 
tinct  possibility  of  security  breaches  undermining  the 
validity  of  the  test  when  the  same  questions  are  re¬ 
peated  across  years  (Linn,  Graue,  &c  Sanders,  1990; 
Koretz  &  Barron,  1998).  Security  also  can  be  a  prob¬ 
lem  when  the  results  are  used  to  make  high-stakes  deci¬ 
sions  for  teachers  and  schools,  regardless  of  whether 
questions  are  changed  each  year  (Linn,  2000). 

Other  limitations  to  the  typical  approach  to  testing 
relate  to  the  integration  between  assessment  and  in¬ 
struction.  Statewide  tests  are  typically  administered 
apart  from  the  regular  curriculum,  so  students  may 
perceive  these  tests  as  disconnected  from  their  every¬ 
day  school  experiences.  Consequently,  they  may  not 
be  sufficiently  motivated  to  perform  their  best,  which 
in  turn  compromises  the  validity  of  results.  In  addi¬ 
tion,  separating  classroom  instruction  from  assess¬ 
ment  leads  to  the  perception  that  students  spend  too 
much  time  taking  tests,  even  if  the  amount  of  time 
devoted  to  testing  is  minimal.  This  occurs  in  part 
because  the  time  spent  testing  is  often  considered  as 
time  that  is  taken  away  from  instruction.  A  related 
concern  is  the  narrowing  of  curriculum  that  often 
occurs  as  a  result  of  high-stakes  testing.  There  is  a 
tendency  for  teachers  to  focus  on  the  topics  that  are 


tested  and  to  neglect  those  that  are  not  (Kellaghan  & 
Madaus,  1991;  Madaus,  1988;  Stecher  et  ah,  1998). 
Thus,  although  the  purpose  of  the  test  may  be  to  moni¬ 
tor  progress  of  schools,  the  test  is  likely  to  have  a  sig¬ 
nificant  influence  on  instruction  if  there  are  high  stakes 
attached  to  the  scores.  Finally,  the  need  to  develop  or 
adopt  assessments  that  are  aligned  with  state  and 
local  standards  means  that  existing  tests  may  not  be 
suitable  for  many  schools.  This  creates  problems  for 
generalizing  results  from  one  jurisdiction  to  another. 

Although  the  traditional  paper-and-pencil  standard¬ 
ized  multiple-choice  test  continues  to  be  the  norm,  a 
few  districts  have  recently  experimented  with  computer- 
based  testing.  Advances  in  psychometrics  and  infor¬ 
mation  technology  are  likely  to  accelerate  the  adop¬ 
tion  of  this  approach,  particularly  when  the  tests  are 
administered  (or  downloaded  into  the  school)  via  the 
Internet  (Klein  &  Hamilton,  1999).  We  believe  that 
this  form  of  testing  may  address  many  of  the  prob¬ 
lems  discussed  above,  though  not  all.  In  the  next  sec¬ 
tion,  we  describe  this  approach  and  discuss  some  of 
its  advantages. 


hi.  A  New  Technology  of  Testing 

The  role  of  information  technology  in  virtually 
every  type  of  educational  enterprise  is  growing 
rapidly.  Educational  assessment  is  no  exception. 
Several  well-known  tests  are  now  administered  via 
the  computer,  including  the  Graduate  Record  Exam 
(GRE),  the  Graduate  Management  Admissions  Test 
(GMAT),  and  the  Medical  Licensing  Examination. 
The  high  speed  and  large  storage  capacities  of  today’s 
computers,  coupled  with  their  rapidly  shrinking 
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costs,  make  computerized  testing  a 
promising  alternative  to  tradition¬ 
al  paper-and-pencil  measures.  Al¬ 
though  computers  are  now  used 
widely  for  large-scale,  high-stakes 
admissions  and  licensing  exams, 
their  use  in  the  K-12  market  is  still 
quite  limited.  In  addition,  most 
existing  computerized  assessments 
for  K-12  students  are  administered 
-  locally  on  a  stand-alone  work¬ 
station  rather  than  over  the  Internet.  However,  we 
expect  this  will  change  within  a  few  years. 

The  next  sections  of  this  paper  discuss  some  rele¬ 
vant  technical  issues.  This  is  followed  by  a  descrip¬ 
tion  of  the  kind  of  system  we  envision  and  a  scenario 
for  how  it  might  be  implemented.  We  then  discuss  the 
advantages  of  this  approach.  Later  we  address  the 
issues  and  concerns  that  are  likely  to  arise  as  we 
move  toward  this  new  form  of  testing. 

As  mentioned  earlier,  we  are  primarily  concerned 
with  the  type  of  large-scale  testing  that  is  conducted 
at  the  district  and  state  levels,  rather  than  with  the 
tests  that  teachers  use  for  instructional  feedback.  How¬ 
ever,  the  effects  of  large-scale  tests  on  instruction 
must  be  considered  because  we  know  that  high- 
stakes  assessment  influences  what  happens  in  the 
classroom.  Furthermore,  information  technology  may 
offer  opportunities  to  create  a  closer  link  between 
large-scale  assessment  and  instruction,  so  it  is  worth 
considering  these  tests’  instructional  effects  as  well. 
The  computer  and  the  Internet  obviously  offer 
promising  new  approaches  for  teacher-made  tests, 
but  this  is  beyond  the  scope  of  this  paper. 
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The  computerized  testing  approach  we  discuss 
below  has  three  main  features.  First,  items  are  admin¬ 
istered  adaptively.  Second,  the  system  makes  use  of 
several  different  types  of  questions,  including  selected- 
response  (e.g.,  multiple-choice)  and  constructed- 
response  items  (e.g.,  short-answer  or  fill-in-the-blank 
questions).  Third,  the  assessment  is  administered  via 
the  Internet  rather  than  relying  solely  on  stand-alone 
workstations. 

Adaptive  Versus  Linear  Administration 

Most  paper-and-pencil  tests  present  items  in  a  linear 
fashion — that  is,  items  are  administered  sequentially 
in  a  predefined  order,  and  all  students  are  asked  the 
same  questions  within  a  given  “form”  (version)  of  the 
test.  Students  may  skip  items  and  go  back,  but  the 
order  of  presentation  is  constant.  Some  computerized 
tests  are  also  linear.  However,  technology  provides 
the  opportunity  to  allow  examinee  responses  to  influ¬ 
ence  the  difficulty  of  the  questions  the  student  is  asked. 
This  is  known  as  adaptive  testing.  In  this  type  of  test¬ 
ing,  the  examinee  responds  to  an  item  (or  set  of 
items).  If  the  examinee  does  well  on  the  item(s),  then 
the  examinee  is  asked  more-difficult  items.  If  the 
examinee  does  not  answer  the  item(s)  correctly,  the 
examinee  is  asked  easier  items.  This  process  contin¬ 
ues  until  the  examinee’s  performance  level  is  deter¬ 
mined.  Because  information  about  the  difficulty  of 
each  item  is  stored  in  the  computer,  the  examinee’s 
“score”  is  affected  by  the  difficulty  of  the  items  the 
examinee  is  able  to  answer  correctly. 

Computers  permit  this  type  of  interactive  testing  to 
be  conducted  in  a  rapid  and  sophisticated  manner. 
Because  items  can  be  scored  automatically  and  imme- 


diately,  the  computer  can  select  the  next  item  almost 
instantly  after  an  examinee  responds  to  the  previous 
one.  Current  computerized  adaptive  testing  systems, 
or  CATs,  use  item  response  theory  (IRT)  to  estimate 
examinee  proficiency  and  determine  item  selection. 
The  length  of  a  CAT  may  be  specified  in  advance  or 
it  may  be  based  on  the  examinee’s  responses.  With  the 
latter  type  of  CAT,  the  computer  stops  administering 
items  once  the  examinee’s  proficiency  has  been  esti¬ 
mated  to  some  prespecified  degree  of  precision. 

Item  Format 

Currently,  most  CAT  systems  rely  on  multiple-choice 
items;  i.e.,  questions  in  which  the  examinee  selects  one 
choice  from  among  four  or  five  alternatives.  These 
selected-response  items  are  commonly  used  in  large- 
scale  testing  because  the  answers  to  them  can  be 
machine-scored,  which  minimizes  costs.  Computers 
also  can  accommodate  selected-response  items  that 
vary  from  the  standard  four-  or  five-option  multiple- 
choice  item.  For  example,  examinees  might  be  asked 
to  select  a  subset  of  choices  from  a  list. 

Many  large-scale  testing  programs  that  traditionally 
have  relied  on  selected-response  items  are  now  exploring 
the  use  of  items  that  require  examinees  to  generate 
their  own  answers — these  are  called  constructed- 
response  items.  Computers  can 

Many  large-scale  testing  accommodate  a  wide  variety  of 

such  items.  For  example,  students 
programs  .  .  .  are  now  ,  .  , 

may  be  asked  to  move  or  organize 

exploring  the  use  of  items  0bjects  on  the  screen  (e.g.,  on  a  his- 
that  require  examinees  to  tory  test,  put  events  in  the  order  in 
generate  their  own  answers .  which  they  occurred).  Other  tests  may 

-  involve  students  using  the  comput- 
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er  to  prepare  essay  answers  or  other  products.  Some 
constructed-response  items  may  be  machine  scored, 
particularly  if  the  responses  are  brief  and  straightfor¬ 
ward  (e.g.,  a  numerical  response  to  a  math  problem). 
Researchers  are  exploring  the  use  of  computerized 
essay  scoring,  and  it  is  likely  that  future  generations 
of  examinees  will  take  tests  that  involve  automatic 
scoring  of  extended  responses.  Currently,  however, 
most  constructed  responses  must  be  scored  by  trained 
readers. 

Possible  Scenario  for  Web-Based  Test 
Administration 

Clearly,  computers  offer  the  possibility  of  radically 
changing  the  way  students  take  tests.  There  are  many 
ways  this  could  happen.  We  discuss  below  one  sce¬ 
nario  for  computerized  testing,  involving  the  delivery 
of  assessment  items  adaptively  and  the  collection  of 
student  response  data  over  the  Internet.  It  is  impor¬ 
tant  to  recognize  that  the  adoption  of  this  or  other 
computer-based  assessment  systems  is  likely  to  be 
gradual:  It  will  evolve  over  a  period  of  time  during 
which  old  and  new  approaches  are  used  simultane¬ 
ously.  Thus,  there  will  be  efforts  to  ensure  the  com¬ 
parability  of  results  from  web-based  and  paper-and- 
pencil  systems.  We  return  to  this  problem  in  a  later 
section. 

In  the  scenario  we  envision,  a  large  set  of  test  items 
is  maintained  on  a  central  server.  This  “item  bank” 
contains  thousands  of  questions  per  subject  area, 
covering  a  wide  range  of  topics  and  difficulty  levels. 
For  example,  a  math  item  bank  would  include  items 
covering  numbers  and  operations,  algebra,  geometry, 
statistics,  and  other  topics.  Within  each  area,  questions 


range  from  extremely  easy  to  very  difficult.  The  ques¬ 
tions  also  are  drawn  from  a  wide  range  of  grade  levels. 

On  the  day  of  testing,  the  entire  item  bank  (or  a 
large  portion  of  it)  for  the  subject  being  tested  (e.g., 
science)  is  downloaded  to  a  school  from  the  central 
server.  Students  take  the  test  in  their  classrooms  or  in 
the  school’s  computer  lab.  The  items  are  administered 
adaptively,  so  each  response  leads  to  a  revised  esti¬ 
mate  of  the  student’s  proficiency  and  a  decision  either 
to  stop  testing  or  to  administer  an  additional  item  that 
is  harder  or  easier  than  the  previous  one.  The  stu¬ 
dent’s  final  score  is  computed  almost  instantly  and  is 
uploaded  to  a  centralized  data  file.  Scores  may  be 
given  to  students  that  same  day  so  they  know  how 
well  they  did.  Students  complete  the  testing  program 
several  times  a  year  rather  than  the  “spring  only”  ap¬ 
proach  that  is  typical  of  current  statewide  testing  pro¬ 
grams.  Each  student  has  a  unique  identifier,  so  stu¬ 
dents’  progress  can  be  monitored  even  if  they  change 
classrooms  or  schools. 

The  scores  can  be  used  to  inform  several  decisions. 
Policymakers  and  staff  in  district  or  state  offices  of 
education  may  use  the  results  to  monitor  achievement 
across  schools  and  provide  rewards  or  interventions. 
Results  also  may  provide  evidence  regarding  the  effec¬ 
tiveness  of  various  educational  programs  and  curricu¬ 
lum  materials. 

Most  large-scale  assessments  are  used  for  external 
monitoring  and  accountability  purposes.  They  are 
rarely  if  ever  used  for  instructional  feedback.  Never¬ 
theless,  the  greater  frequency  of  administration  and 
the  prompt  availability  of  results  from  a  computer- 
based  system  may  enable  teachers  to  use  the  scores  to 
assign  grades,  modify  their  instruction  in  response  to 
common  problems  or  misconceptions  that  arise,  and 


provide  individualized  instruction  that  is  tailored  to 
student  needs.  Similarly,  principals  may  use  the 
results  to  monitor  student  progress  across  different 
classrooms. 

Potential  Advantages  over  Paper-and-Pencil 
Testing 

A  computerized-adaptive  testing  system  that  is 
Internet-based  offers  several  advantages  over  paper- 
and-pencil  multiple-choice  tests.  Some  of  these  bene¬ 
fits  arise  from  the  use  of  computers,  and  adaptive 
administration  in  particular,  whereas  others  derive 
from  delivering  the  tests  over  the  Internet.  These  ben¬ 
efits  are  discussed  in  turn  below. 

Benefits  of  Computerized  Adaptive  Testing 

One  of  the  major  advantages  of  CAT  is  decreased 
testing  time.  Because  the  difficulty  of  the  questions  a 
student  is  asked  is  tailored  to  that  student’s  profi¬ 
ciency  level,  students  do  not  waste  time  on  questions 
that  are  much  too  easy  or  too  difficult  for  them.  It 
takes  many  fewer  items  to  achieve  a  desired  level  of 
score  precision  using  CAT  than  using  a  standard 
multiple-choice  test  (see,  e.g.,  Bunderson  et  al.,  1989). 
This  not  only  saves  time,  but  it  may  minimize  student 
frustration,  boredom,  and  test  anxiety.  Similarly,  this 
method  reduces  the  likelihood  of  ceiling  and  floor 
effects  that  occur  when  a  test  is  too  easy  or  too  hard 
for  a  student,  thereby  providing  a  more  accurate 
measurement  for  these  students  than  is  obtained 
when  the  same  set  of  questions  is  administered  to  all 
students.  The  use  of  computers  may  also  reduce  costs 
because  the  hardware  can  be  used  for  other  instruc¬ 
tional  purposes  rather  than  being  dedicated  solely  to 


the  testing  function.  Whether  such  savings  are  realized 
depends  on  a  number  of  factors  that  we  discuss  later. 

Another  potential  benefit  is  improved  test  security. 
Because  each  student  within  a  classroom  takes  a  dif¬ 
ferent  test  (i.e.,  one  that  is  tailored  to  that  student’s 
proficiency  level)  and  because  the  bank  from  which 
the  questions  are  drawn  contains  several  thousand 
items,  there  is  little  risk  of  students  being  exposed  to 
items  in  advance  or  of  teachers  coaching  their  stu¬ 
dents  on  specific  items  before  or  during  the  testing 
session.  As  software  to  increase  test  security  becomes 
more  widely  available,  the  risk  of  unauthorized  access 
to  test  items  and  results  should  diminish,  thereby  fur¬ 
ther  increasing  the  validity  of  test  results. 

CATs  are  particularly  useful  for  evaluating  growth 
over  time.  Progress  can  be  measured  on  a  continuous 
scale  that  is  not  tied  to  grade  levels.  This  scale  enables 
teachers  and  parents  to  track  changes  in  students’ 
proficiency  during  the  school  year  and  across  school 
years,  both  within  and  across  content  areas.  Students 
take  different  items  on  different  occasions,  so  scores 
are  generally  not  affected  by  exposure  to  specific 
items.  Thus,  the  test  can  be  administered  several 
times  during  the  year  without  threatening  the  validi¬ 
ty  of  the  results.1  This  offers  much  greater  potential 
for  the  results  to  have  a  positive  influence  on  instruc¬ 
tion  than  is  currently  available  in  the  typical  one¬ 
time-only  spring  test  administration  schedule.  CATs 
can  also  accommodate  the  testing  of  students  who 
transferred  into  the  school  during  the  year,  those  who 
may  have  been  absent  on  the  scheduled  testing  date, 


htem  exposure  is  a  topic  of  growing  interest  among  psychometricians 
and  test  publishers,  and  its  effects  need  to  be  considered  when  developing 
item  banks  and  testing  schedules.  New  methods  have  been  devised  to  con¬ 
trol  exposure  so  that  CATs  can  be  administered  multiple  times  to  the 
same  examinees,  though  none  eliminates  the  risk  completely  (see,  e.g., 
Revuelta  8c  Ponsoda,  1998). 
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and  those  with  learning  and  other  disabilities  who 
may  require  additional  time,  large  type,  or  other  test¬ 
ing  accommodations.  Several  school  systems  serving 
special  populations,  such  as  the  Juvenile  Court  and 
Community  Schools  in  Los  Angeles,  have  adopted 
CAT  systems  to  address  the  widely  varying  ability 
levels  and  extremely  high  mobility  rates  of  their  stu¬ 
dents. 

Finally,  computer-based  testing  offers  the  opportu¬ 
nity  to  develop  new  types  of  questions,  especially 
those  that  can  assess  complex  problem-solving  skills. 
For  example,  students  can  observe  the  effects  on 
plant  growth  of  various  amounts 
of  water,  types  of  fertilizer,  and 
exposure  to  sunlight  in  order  to 
make  inferences  about  the  rela¬ 
tionships  among  these  factors. 

Several  development  efforts  are 
currently  under  way  to  utilize  tech¬ 
nology  by  developing  innovative 
tasks,  including  essay  questions 
that  can  be  machine-scored,  simulations  of  laboratory 
science  experiments,  and  other  forms  of  constructed- 
response  items  that  require  students  to  produce, 
rather  than  just  select,  their  answers.  Many  of  these 
efforts  have  sought  to  incorporate  multimedia  tech¬ 
nology  to  expand  the  range  of  activities  in  which  stu¬ 
dents  can  engage.  Bennett  (1998)  describes  some  of 
the  possibilities  offered  by  computer  technology,  such 
as  the  use  of  multimedia  to  present  films  and  broad¬ 
casts  as  artifacts  for  analysis  on  a  history  test. 

Computerized  assessments  are  especially  appropri¬ 
ate  for  evaluating  progress  in  areas  where  computers 
are  used  frequently,  such  as  writing.  Russell  and  Haney 
(1997),  for  example,  found  that  students  who  were 
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accustomed  to  using  computers  in  their  classes  per¬ 
formed  better  on  a  writing  test  when  they  could  use 
computers  rather  than  paper  and  pencil.  Students 
using  computers  wrote  more  and  organized  their  com¬ 
positions  better.  As  instruction  comes  to  depend  more 
heavily  on  technology,  assessment  will  need  to  follow 
in  order  to  provide  valid  measurement  that  is  aligned 
with  curriculum  (Russell  8c  Haney,  2000). 

Benefits  of  Web  Administration 

Web-based  tests  offer  efficient  and  inexpensive  scor¬ 
ing.  Scoring  is  done  on-line,  eliminating  the  need  for 
packaging,  shipping,  and  processing  of  answer 
sheets.  Students  could  be  given  their  results  immedi¬ 
ately  after  completing  the  tests.  A  web-based  system 
would  allow  all  records  to  be  stored  automatically  at 
a  central  location,  facilitating  the  production  of  score 
summaries.  Norms  could  be  constantly  updated,  and 
analyses  of  results  could  be  done  quickly  and  efficient¬ 
ly.  Teachers  would  have  results  in  time  to  incorporate 
the  information  into  their  instruction.  Teachers  could 
also  use  results  for  assigning  grades,  so  that  students 
would  be  motivated  to  do  well  on  the  tests. 

There  are  clear  benefits  to  maintaining  the  testing 
software  and  item  banks  in  a  central  location  so  they 
can  be  downloaded  onto  school  computers  via  the  Inter¬ 
net.  Economies  of  scale  would  be  achieved  by  refreshing 
the  item  bank  from  a  central  location.  New  questions 
could  easily  be  inserted  into  existing  tests  to  gather  the 
data  on  them  for  determining  their  difficulty  and 
whether  they  would  be  appropriate  for  operational 
use.  Updating  of  software  is  done  centrally  rather 
than  locally,  so  there  is  no  need  for  expensive  hard¬ 
ware  and  software  at  the  school  site.  Moreover,  down- 


loading  the  bank  (or  a  portion  of  it)  onto  a  local  serv¬ 
er  and  uploading  the  results  daily  avoids  the  delays  in 
computer  response  times  that  might  otherwise  arise  if 
students  were  connected  directly  to  the  main  server. 
Thus,  administering  tests  via  the  web  addresses  sev¬ 
eral  logistical  and  technical  problems  associated  with 
both  paper-and-pencil  testing  and  computer-based 
testing  on  local  workstations. 

iv.  Issues  and  Concerns  with 
Computerized  Adaptive  Testing 

espite  the  many  potential  advantages  of  CATs,  a 
number  of  issues  must  be  resolved  before  a  com¬ 
puterized  testing  program  can  be  implemented  on  a 
large  scale.  This  section  discusses  several  of  these.  In 
the  next  section,  we  discuss  some  of  the  additional 
issues  associated  with  administering  CATs  over  the 
web.  We  do  not  attempt  to  resolve  these  issues. 
Instead,  this  discussion  is  intended  to  help  formulate 
a  research  agenda  that  will  support  the  future  devel¬ 
opment  of  web-based  CATs. 

Psychometric  Issues  Related  to  CATs 

The  use  of  CATs  raises  a  host  of  psychometric  con¬ 
cerns  that  have  been  the  focus  of  intense  discussion  in 
the  psychometric  community  over  the  last  decade.  Some 
of  the  major  issues  that  have  been  examined  are  sum¬ 
marized  below,  but  the  discussion  is  not  exhaustive. 

Does  the  Medium  Matter ? 

Do  CATs  function  differently  from  paper-and-pencil 
tests  that  contain  the  same  items?  CATs  reduce  cer- 


tain  kinds  of  low-frequency  error  associated  with 
paper-and-pencil  testing,  such  as  answer  sheet/item 
number  mismatches,  distractions  from  other  items  on 
the  printed  page,  and  errors  made  by  scanners.  How¬ 
ever,  CATs  may  introduce  other  kinds  of  errors.  For 
example,  a  reading  comprehension  item  that  has  the 
passage  and  questions  on  separate  screens  might 
measure  different  skills  than  the  traditional  one  with 
the  passage  and  questions  on  the  same  or  facing  pages. 
Using  multiple  screens  places  a  heavy  emphasis  on  an 
examinee’s  ability  to  recall  a  passage,  or  part  of  one, 
presented  on  a  previous  screen  (Bunderson  et  al.,  1989). 
CATs  also  may  place  heavier  demands  on  certain 
skills,  such  as  typing,  thereby  changing  the  nature  of 
what  is  tested. 

In  general,  studies  have  shown  that  computer-based 
and  paper-and-pencil  tests  tend  to  function  similarly, 
at  least  in  the  multiple-choice  format  (Bunderson  et 
ah,  1989;  Mead  &  Drasgow,  1993;  Perkins,  1993; 
Segall  et  ah,  1997;  Zandvliet  6 c  Farragher,  1997).  The 
two  methods  produce  similar  statistical  distributions 
(means,  standard  deviations,  reliabilities,  and  stan¬ 
dard  errors  of  measurement)  and  are  comparable  in 
their  predictive  validity.  However,  there  are  still  pos¬ 
sible  threats  to  comparability. 

The  dependence  of  test  scores  on  keyboarding 
speed  for  one  medium  of  test  administration  but  not 
another  is  one  potential  threat.  A  few  studies  have 
examined  relationships  between  specific  types  and 
amounts  of  experience  with  computers  and  perfor¬ 
mance  on  tests.  For  example,  Russell  (1999)  found 
that  keyboarding  speed  was  a  good  predictor  of  stu¬ 
dents’  scores  on  open-ended  language  arts  and  sci¬ 
ence  tests  taken  from  the  Massachusetts  Comprehen¬ 
sive  Assessment  System  and  the  National  Assessment 


of  Educational  Progress  (NAEP).  However,  control¬ 
ling  for  keyboarding  speed,  computer  experience  was 
not  related  to  test  performance.  In  contrast,  scores  on 
an  open-ended  math  test  were  only  weakly  predicted 
by  keyboarding  speed,  probably  because  the  answers 
to  math  items  rely  mainly  on  less  frequently  used 
number  and  symbol  keys.  Students  with  computer 
experience  are  not  much  faster  at  identifying  and 
using  these  keys  than  are  students  without  computer 
experience. 

Some  studies  have  found  that  differences  in  mean 
scores  due  to  administration  medium  can  be  explained 
by  test  type.  Van  de  Vijver  and  Harsveld  (1994)  con¬ 
ducted  a  study  with  326  applicants  to  the  Royal 
Military  Academy  in  the  Netherlands.  One  group 
took  the  computerized  version  of  the  General 
Aptitude  Test  Battery  (GATB),  and  the  other  took  a 
paper-and-pencil  form  of  this  test.  From  this  study 
and  that  of  others,  Van  de  Vijver  and  colleagues  ten¬ 
tatively  concluded  that  cognitively  simple  clerical 
tests  are  more  susceptible  to  medium  effects  than  are 
more  complex  tasks.  They  speculated  that  these  dif¬ 
ferences  might  be  related  to  previous  computer  usage 
and  should  disappear  with  repeated  administration 
of  the  computerized  version. 

Some  medium  effects  have  been  observed  with 
open-response  tests.  As  discussed  earlier,  experimen¬ 
tal  studies  have  shown  that  students  who  are  accus¬ 
tomed  to  writing  with  computers  perform  better  on 
writing  tests  that  allow  them  to  use  computers  than 
they  do  on  standard  paper-and-pencil  writing  tests 
(Russell  8c  Haney,  1997;  Russell,  1999).  The  medium 
can  also  affect  the  way  that  responses  are  scored. 
Powers  et  al.  (1994)  discovered  that  essay  responses 
printed  from  a  computer  are  assigned  lower  scores 


than  the  same  essays  presented  in  handwritten  for¬ 
mat.  Additional  research  is  needed  to  examine  medi¬ 
um  effects  across  the  range  of  subject  areas,  grade  lev¬ 
els,  and  item  formats  that  are  likely  to  be  included  in 
a  large-scale  testing  system,  and  to  identify  implica¬ 
tions  of  these  effects. 

Additional  research  is  also  needed  to  examine  dif¬ 
ferences  in  medium  effects  on  tests  in  different  sub¬ 
jects  as  well  as  for  different  examinee  groups.  This 
research  is  especially  important  in  contexts  in  which 
there  is  a  need  to  compare  the  results  from  the  two 
approaches  to  testing — e.g.,  if  computerized  testing  is 
phased  in  gradually,  perhaps  starting  with  older  chil¬ 
dren  and  working  backwards  to  younger  ones  (or 
vice  versa). 

How  Should  an  Adaptive  Test  Be  Implemented ? 

As  discussed  above,  there  are  advantages  to  making 
tests  adaptive.  In  particular,  adaptive  tests  are  usual¬ 
ly  shorter  (i.e.,  fewer  items  are  needed  to  obtain  a 
given  level  of  reliability)  than  standard  tests.  This 
occurs  because  item  difficulty  is  more  closely  tailored 
to  each  examinee’s  proficiency  level.  In  a  meta-analysis 
of  20  studies,  Bergstrom  (1992)  found  that  mode  of 
administration  (adaptive  or  non-adaptive)  did  not 
affect  performance,  regardless  of  test  content  or 
examinee  age. 

Still,  questions  remain  about  how  to  implement 
adaptivity.  For  example,  there  are  three  approaches  to 
determining  when  to  stop  an  adaptive  test.  Stopping 
rules  can  dictate  fixed-length,  variable-length,  or 
mixed  solutions  to  this  problem.  On  a  fixed-length 
test,  the  number  of  items  to  be  administered  is  fixed 
in  advance  of  the  administration  (but  not  the  specific 


items).  On  a  variable-length  test,  items  are  adminis¬ 
tered  until  an  examinee’s  proficiency  level  is  estimat¬ 
ed  to  within  a  prespecified  degree  of  precision.  A 
mixed  solution  to  the  stopping  rule  problem  com¬ 
bines  some  aspects  of  the  fixed-length  and  variable- 
length  strategies.  Researchers  have  found  that  fixed- 
length  tests  can  perform  as  well  as  variable-length  tests 
in  terms  of  the  level  of  uncertainty  about  the  final  pro¬ 
ficiency  estimate  (McBride,  Wetzel,  &  Hetter,  1997), 
but  the  decision  about  stopping  rules  needs  to  be 
informed  by  a  number  of  factors  related  to  the  con¬ 
text  of  testing. 

Another  question  pertains  to  item  review,  or  the  prac¬ 
tice  of  allowing  examinees  to  change  their  answers  to 
previously  completed  questions.  This  is  often  cited  as 
a  way  to  make  CATs  more  palatable  to  examinees  who 
are  accustomed  to  paper-and-pencil  testing.  When  asked, 
examinees  often  state  a  preference  for  this  option 
(Vispoel,  Rocklin,  &  Wang,  1994).  Some  research  sug¬ 
gests  that  when  item  review  is  permitted,  examinees 
who  change  earlier  answers  improve  their  scores,  but 
only  by  a  small  amount  (Gershon  &  Bergstrom, 
1995).  Although  the  ability  to  change  answers  is 
common  on  paper-and-pencil  tests,  researchers  and 
test  developers  have  expressed  concern  that  CATs  might 
be  susceptible  to  the  use  of  item  review  to  “game”  the 
test.  For  example,  examinees  might  deliberately 
answer  early  questions  wrong  or  omit  these  questions 
so  that  subsequent  questions  are  easier,  and  then  go 
back  and  change  their  initial  wrong  answers,  which 
could  result  in  a  proficiency  estimate  that  is  artificial¬ 
ly  high  (Wainer,  1993).  Use  of  this  strategy  does 
sometimes  result  in  higher  scores,  but  its  effects 
depend  on  a  number  of  factors  including  the  statisti¬ 
cal  estimation  method  used  and  the  examinee’s  true 


proficiency  level  (Vispoel  et  al.,  1999).  To  mitigate  the 
risk  of  using  item  review  to  “game”  the  test,  it  may  be 
limited  to  small  timed  sections  of  the  test  (Stocking, 
1996),  and  techniques  to  identify  the  plausibility  of 
particular  response  patterns  may  be  used  to  evaluate 
results  (Hulin,  Drasgow,  &  Parson,  1983). 

Item  Bank  Development  and  Management 

Successful  implementation  of  a  CAT  system  requires 
a  sufficiently  large  bank  of  items  from  which  the  test¬ 
ing  software  can  select  questions  during  test  adminis¬ 
tration.  Some  of  the  issues  related  to  item  bank  man¬ 
agement  are  psychometric.  For  example,  how  large 
must  the  banks  be  to  accommodate  a  particular  test¬ 
ing  schedule  or  test  length?  How  many  times  can  an 
item  be  reused  before  it  must  be  removed  from  the 
bank  to  eliminate  effects  of  overexposure?  What  is 
the  optimal  distribution  of  item  difficulty  in  a  bank? 
What  is  the  most  effective  way  to  pretest  items  that 
will  be  used  to  refresh  a  bank?  These  and  related  top¬ 
ics  have  been  the  focus  of  recent  research  on  CATs. 
For  example,  Stocking  (1994)  provided  guidelines  for 
determining  the  optimal  size  of  an  item  bank  and  for 
making  one  bank  equivalent  to  another.  Most  of  the 
research  on  this  problem  has  focused  on  multiple- 
choice  questions,  which  are  scored  either  correct  or 
incorrect  and  which  can  be  completed  quickly  by 
examinees.  Use  of  constructed-response  items  raises 
additional  questions,  particularly  because  fewer  items 
can  be  administered  in  a  given  amount  of  time. 

Several  organizations  have  produced  item  banks 
appropriate  for  K-12  testing.  The  Northwest  Evalua¬ 
tion  Association  (NWEA),  for  example,  has  pub¬ 
lished  CATs  in  several  subjects  and  has  item  banks 


that  are  appropriate  for  students  in  elementary 
through  secondary  grades.  Many  of  these  testing  sys¬ 
tems  offer  the  possibility  of  modifying  the  assessment 
so  that  it  is  aligned  with  local  or  state  content  stan¬ 
dards.  However,  it  is  not  always  clear  what  methods 
have  been  used  to  determine  this  alignment  or 
whether  curtailing  the  types  of  questions  that  can  be 
asked  would  change  their  characteristics. 

Developing  new  test  items  is  time-consuming  and 
expensive.  Although  items  may  continue  to  be  pro¬ 
duced  by  professional  test  publishers,  new  ways  of 
generating  tests  should  be  studied.  It  is  becoming 
increasingly  possible  for  computers  to  generate  items 
automatically,  subject  to  certain  constraints.  This  type 
of  item  generation,  which  can  occur  in  real  time  as 
examinees  take  the  test,  has  great  potential  for  saving 
money  and  for  reducing  test  security  problems.  How¬ 
ever,  it  may  lead  to  certain  undesirable  test  prepara¬ 
tion  strategies. 

Teachers  are  another  promising  source  of  items, 
and  including  teachers  in  the  item  generation  process 
may  significantly  enhance  their  acceptance  of  this 
type  of  large-scale  testing  program.  The  inclusion  of 
teachers  may  be  especially  valuable  in  situations  where 
a  test  is  intended  to  be  aligned  with  a  particular  local 
curriculum. 

Regardless  of  the  source  of  the  items,  several  prac¬ 
tical  concerns  arise.  These  include  copyright  laws  and 
protections,  means  of  safeguarding  data  on  item  char¬ 
acteristics,  and  arrangements  for  selling  or  leasing  item 
banks  to  schools,  educational  programs,  and  parents. 


v.  Issues  and  Concerns  with 
Web-Based  Testing 


A  dministering  CATs  via  the  Internet  rather  than  on 
x\.  stand-alone  machines  adds  another  layer  of 
complexity.  However,  some  of  the  problems  discussed 
below,  such  as  those  related  to  infrastructure,  would 
need  to  be  addressed  even  in  a  system  that  used  local¬ 
ly  administered  CATs.  We  address  them  here  because 
the  requirements  that  must  be  in  place  for  successful 
web-based  test  administration  are  often  closely  linked 
with  those  that  are  necessary  for  any  computerized 
testing.  The  section  begins  with  a  discussion  of  infra¬ 
structure,  including  equipment  and  personnel,  followed 
by  a  brief  discussion  of  costs.  Finally,  we  bring  up 
some  issues  related  to  reporting  of  assessment  results. 

Infrastructure 

The  feasibility  of  delivering  assessment  over  the  Inter¬ 
net  depends  on  both  the  material  infrastructure  and 
the  human  capital  already  in  place  in  schools.  Recent 
data  indicate  that  the  availability  of  computers  and 
the  frequency  and  quality  of  Internet  access  are  be¬ 
coming  sufficient  to  support  this  form  of  testing.  In 
1999,  95  percent  of  public  schools  had  some  form  of 
Internet  access,  and  fully  65  percent  of  instructional 
classrooms  in  the  nation’s  public  schools  were  con¬ 
nected  to  the  Internet  (U.S.  Department  of  Education, 
2000).  In  earlier  years,  schools  with  large  propor¬ 
tions  of  students  living  in  poverty  were  less  likely  to 
be  connected  than  were  wealthier  schools,  but  this 
difference  had  disappeared  by  1999.  The  percentage 
of  instructional  rooms  that  are  connected,  in  con¬ 
trast,  does  vary  by  socioeconomic  status.  However, 


this  gap  is  likely  to  shrink  due  in  part  to  the  E-rate 
program,  which  requires  telecommunications  compa¬ 
nies  to  provide  funds  for  Internet  access  in  schools 
(U.S.  Department  of  Education,  1996). 

The  quality  of  Internet  connectivity  has  also  im¬ 
proved  over  the  years.  For  example,  the  percentage  of 
schools  connecting  using  a  dedicated  line  has  in¬ 
creased  significantly  over  a  recent  two-year  period. 
The  use  of  dial-up  networking  declined  from  nearly 
75  percent  of  schools  in  1996  to  14  percent  in  1999 
(U.S.  Department  of  Education,  2000).  Most  of  these 
were  replaced  with  speedier  dedicated-line  network 
connections.  However,  in  contrast  to  the  number  of 
schools  with  Internet  access,  gaps  between  poor  and 
wealthy  schools  in  the  quality  of  connection  persist. 
For  example,  72  percent  of  low-poverty  schools  (those 
with  fewer  than  11  percent  of  students  eligible  for 
free  or  reduced-price  lunches)  and  50  percent  of  high- 
poverty  schools  (those  with  71  percent  or  more  eligi¬ 
ble  students)  had  dedicated  lines  in  1999  (U.S.  Depart¬ 
ment  of  Education,  2000). 

Although  the  improvements  in  connectivity  are 
encouraging,  some  important  questions  remain.  For 
example,  what  is  the  quality  of  the  hardware  avail¬ 
able  to  students?  Are  existing  allocations  of  comput¬ 
ers  to  classrooms  sufficient  to  support  several  rounds 
of  testing  each  year?  In  addition,  the  figures  cited 
above  illustrate  that  schools  serving  large  numbers  of 
students  who  live  in  poverty  may  have  the  basic 
equipment  but  lack  the  features  that  are  necessary  for 
effective  implementation  of  a  large-scale  web-based 
assessment  system.  True  equality  of  access  is  clearly 
an  important  consideration.  More-detailed  surveys  of 
infrastructure  and  a  better  understanding  of  what 
types  of  schools  are  in  need  of  improvements  will 


help  determine  the  feasibility  and  fairness  of  imple¬ 
menting  the  kind  of  system  we  have  been  describing. 

There  are  additional  questions  related  to  how  com¬ 
puters  are  used.  Are  some  machines  dedicated  solely 
to  testing  or  are  they  used  for  other  instructional  activ¬ 
ities?  What  are  the  test  and  data  security  implications 
of  using  the  computers  for  multiple  purposes?  Place¬ 
ment  of  computers  throughout  the  school  building  is 
another  important  consideration.  Do  computers  need 
to  be  in  a  lab  or  can  a  web-based  assessment  system  be 
implemented  in  a  school  that  has  only  a  few  comput¬ 
ers  per  classroom?  Answers  to  these  and  related  ques¬ 
tions  have  implications  for  cost  estimates.  For  exam¬ 
ple,  testing  costs  would  be  reduced  if  the  computers 
that  were  used  for  assessment  activities  were  also 
being  used  for  other  instructional  purposes. 

Human  Capital 

Data  suggest  that  most  schools  have  staff  with  com¬ 
puter  knowledge  sufficient  to  supervise  student  com¬ 
puter  use  and  that  students  are  increasingly  familiar 
with  computers  (U.S.  Department  of  Education,  1999). 
NAEP  data  from  1996  indicate  that  approximately 
80  percent  of  fourth  and  eighth  grade  teachers 
received  some  professional  training  in  computer  use 
during  the  preceding  five  years  (Wenglinsky,  1998), 
and  it  is  likely  that  this  number  has  grown  since  then. 
However,  it  is  not  clear  what  the  quality  of  this  train¬ 
ing  was  or  how  many  of  these  staff  members  have  the 
skills  needed  to  address  the  unexpected  technical 
problems  that  will  inevitably  arise.  Research  indi¬ 
cates  that  teachers’  willingness  to  use  computers  is 
influenced  by  the  availability  of  professional  develop¬ 
ment  opportunities  and  on-site  help  (Becker,  1994), 


so  an  investment  in  training  and  support  will  be  nec¬ 
essary  for  any  large-scale  instructional  or  assessment 
effort  that  relies  on  technology.  Equity  considerations 
also  need  to  be  addressed,  since  teachers  in  more- 
affluent  school  districts  tend  to  have  easier  access  to 
professional  development  and  support. 

Students’  increasing  familiarity  with  the  use  of 
computers  increases  the  feasibility  of  implementing  a 
CAT  system  in  schools.  Data  from  1997  showed  that 
a  majority  of  students  used  computers  at  home  or 
school  for  some  purpose,  including  school  assign¬ 
ments,  word  processing,  e-mail,  Internet  browsing, 
databases,  and  graphics  and  design  (U.S.  Department 
of  Education,  1999).  Of  these  activities,  the  majority 
of  computer  use  was  for  school  assignments.  Some 
differences  in  computer  use  were  observed  across 
racial/ethnic  and  socioeconomic  groups,  with  larger 
differences  reported  for  home  computer  use  than  for 
school  use.  For  example,  in  the  elementary  grades,  84 
percent  of  white  students  and  70  percent  of  black  stu¬ 
dents  used  computers  at  school  in  1997.  The  corre¬ 
sponding  percentages  for  home  computer  use  were 
52  percent  and  19  percent,  a  much  larger  difference 
(U.S.  Department  of  Education,  1999).  The  gap  in 
school  computer  use  is  likely  to  shrink  as  schools  con¬ 
tinue  to  acquire  technological  resources.  However, 
the  difference  in  access  to  computers  at  home,  along 
with  variation  in  the  quality  of  technology  in  the 
schools  (which  is  not  currently  measured  well),  raises 
concerns  about  the  fairness  of  any  testing  system  that 
requires  extensive  use  of  computers. 

Because  schools  differ  in  the  degree  to  which  they 
are  prepared  for  large-scale  computerized  testing,  it 
may  be  necessary  to  implement  a  system  of  testing 
that  supports  both  paper-and-pencil  and  computer- 


ized  adaptive  testing  simultaneously.  Such  a  system  is 
only  feasible  if  there  is  sufficient  evidence  supporting 
the  comparability  of  these  two  forms  of  testing.  Al¬ 
though  much  of  the  research  discussed  earlier  suggests 
a  reasonable  degree  of  comparability,  the  differences 
that  have  been  observed,  particularly  on  open-response 
tests  and  those  that  require  reading  long  passages, 
suggest  that  caution  is  warranted  in  making  compar¬ 
isons  across  testing  approaches. 

Costs  and  Charges 

A  number  of  costs  are  associated  with  this  form  of 
testing,  including  development  of  item  banks,  devel¬ 
opment  of  software  with  secure  downloading  and 
uploading  capabilities,  and  acquisition  of  the  necessary 
hardware.  How  these  costs  compare  with  the  cost  of 
traditional  paper-and-pencil  testing  is  not  known, 
and  the  cost  difference  between  these  two  options  is 
likely  to  change  as  technology  becomes  less  expensive. 
A  critical  area  of  research,  therefore,  is  an  analysis  of 
the  costs  of  alternative  approaches.  This  analysis 
would  need  to  consider  the  direct  costs  of  equipment 
and  software,  as  well  as  labor  and  opportunity  costs. 
For  example,  if  web-based  testing  can  be  completed 
in  half  the  time  required  for  traditional  testing,  teach¬ 
ers  may  spend  the  extra  time  providing  additional 
instruction  to  students.  In  addition,  cost  estimates 
should  consider  the  fact  that  the  computers  and 
Internet  connections  used  for  assessment  are  likely  to 
be  used  for  other  instructional  and  administrative 
activities. 

It  will  also  be  important  to  investigate  various 
ways  of  distributing  costs  and  getting  materials  to 
schools.  Should  schools  pay  an  annual  fee  to  lease 


access  to  the  item  bank  or  should  they  be  charged  for 
each  administration  of  the  test?  The  advantages  and 
limitations  of  alternative  approaches  must  be  identi¬ 
fied  and  compared. 

An  additional  source  of  expenses  is  the  scoring  of 
responses,  particularly  essays  and  other  constructed- 
response  questions.  Student  answers  to  open-ended 
questions  are  usually  scored  by  trained  raters.  How¬ 
ever,  the  technology  required  for  computerized  scor¬ 
ing  of  such  responses  is  developing  rapidly.  Data 
about  the  feasibility  of  machine  scoring  of  open- 
ended  responses  will  be  critical  for  determining  the 
degree  to  which  an  operational  web-based  CAT  sys¬ 
tem  can  incorporate  novel  item  types. 

Cost  analysis  of  assessment  is  complicated  by  the 
fact  that  many  of  the  costs  and  benefits  are  difficult 
to  express  in  monetary  terms,  or  even  to  quantify. 
Furthermore,  costs  and  benefits 
vary  depending  on  the  context  of 
testing  and  how  scores  are  used. 

For  example,  if  high  school  gradu¬ 
ation  is  contingent  on  passing  an 
exit  examination,  some  students 
who  would  have  been  given  diplo¬ 
mas  in  the  absence  of  the  test  will  be  retained  in 
school.  Although  the  cost  to  the  school  of  providing 
an  extra  year  of  education  to  a  student  can  be  quan¬ 
tified,  the  opportunity  costs  for  the  student  and  the 
effects  of  the  retention  on  his  or  her  self-concept  and 
motivation  to  learn  are  much  more  difficult  to 
express  in  monetary  terms.  We  also  do  not  have  suf¬ 
ficient  evidence  to  determine  whether,  and  to  what 
degree,  the  extra  year  of  high  school  will  increase 
productivity  or  result  in  increased  dropout  rates  and 
therefore  increase  social  costs  (Catterall  &  Winters, 


.  .  .  costs  and  benefits 
vary  depending  on  the 
context  of  testing  and  how 
scores  are  used. 


1994).  This  is  just  one  example  of  a  potential  out¬ 
come  of  testing,  and  to  the  degree  that  web-based 
CATs  improve  the  validity  of  this  type  of  decision 
(e.g.,  by  increasing  test  security),  a  cost-benefit  analy¬ 
sis  needs  to  reflect  this  improvement.  The  current 
assessment  literature  is  lacking  in  studies  of  costs  and 
benefits  of  different  approaches  to,  and  outcomes  of, 
large-scale  assessment. 


.  .  .  the  ease  of  access  to 
results  may  have  both 
positive  and  negative 
consequences  for  students 
and  teachers ,  and  it  is 
imperative  that  we  anticipate 
possible  misuse  of  scores. 


Reporting  Results 

As  discussed  earlier,  two  of  the  anticipated  benefits  of 
a  web-based  CAT  system  are  its  capabilities  for 
instant  scoring  and  centralized  data  storage.  These 
features  have  the  potential  for  improving  the  quality 
and  utility  of  the  information  obtained  from  testing 
and  making  results  more  timely  and  useful  for  stu¬ 
dents,  parents,  teachers,  and  policymakers.  Current 
assessment  programs  do  not  share  these  features.  Thus, 
we  have  little  experience  to  guide  decisions  about 
how  to  generate,  store,  and  distribute  results.  For 
example,  what  types  of  information  should  be  pro¬ 
duced?  Should  the  database  of  school,  district,  state, 
or  national  norms  be  continuously  updated  and  used 
to  interpret  individuals’  results?  Should  students  receive 
their  scores  immediately,  or  is  it 
better  to  have  score  reports  distrib¬ 
uted  by  teachers  some  time  after  the 
testing  date? 

In  today’s  high-stakes  testing 
environment,  the  ease  of  access  to 
results  may  have  both  positive  and 
negative  consequences  for  students 
and  teachers,  and  it  is  imperative 
that  we  anticipate  possible  misuse 
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of  scores.  For  example,  although  the  possibility  for 
principals  and  superintendents  to  have  timely  infor¬ 
mation  about  the  progress  of  students  by  classroom 
may  help  them  provide  needed  assistance  and  encour¬ 
agement,  this  information  could  also  be  used  in  ways 
that  unfairly  penalize  teachers  or  students.  It  may 
also  increase  public  pressure  to  raise  test  scores,  cre¬ 
ating  unrealistic  demands  for  rapid  improvement. 
These  concerns  underscore  the  need  to  examine  the 
incentive  structures  and  policy  context  surrounding 
any  assessment  system,  including  the  kind  of  system 
we  have  described  here. 

Transmitting  test  questions  and  possibly  student 
responses  over  the  Internet  raises  a  host  of  addition¬ 
al  concerns.  How  will  security  of  the  tests  and  results 
be  ensured?  Who  will  have  access  to  what  data  and 
when?  Who  will  have  access  to  the  system?  For  ex¬ 
ample,  can  students  practice  for  the  tests  by  accessing 
some  or  all  of  the  item  banks  from  their  homes?  Will 
parents  be  able  to  see  how  well  their  children  are 
doing  relative  to  the  typical  performance  of  students 
in  the  same  or  other  classrooms?  The  answers  to  these 
and  related  questions  will  play  a  large  part  in  deter¬ 
mining  the  validity,  feasibility,  fairness,  and  accept¬ 
ability  of  web-based  testing. 


vi.  Conclusion 

Web-based  testing,  and  especially  the  computer¬ 
ized  adaptive  version  of  it,  will  soon  be  com¬ 
peting  with  and  possibly  replacing  the  paper-and- 
pencil  tests  that  are  now  relied  upon  for  large-scale 
K-12  assessment  programs.  This  new  type  of  testing 
will  use  the  Internet  to  deliver  tests  to  students  in 


their  schools.  Test  developers  and  policymakers  will 
need  to  prepare  for  this  transition  by  embarking  on  a 
comprehensive  program  of  research  that  addresses  a 
number  of  critical  web-based  testing  issues. 

A  system  in  which  adaptive  assessments  are  deliv¬ 
ered  to  students  at  their  schools  over  the  Internet  has 
several  advantages,  including  decreased  testing  time, 
enhanced  security,  novel  item  types,  and  rapid  report¬ 
ing.  However,  before  this  system  can  be  put  in  place, 
a  number  of  issues  need  to  be  considered,  such  as  the 
psychometric  quality  of  the  measures,  methods  for 
maintaining  item  banks,  infrastructure,  human  capi¬ 
tal,  costs,  comparability  with  paper-and-pencil  mea¬ 
sures  across  subject  matter  areas,  and  reporting 
strategies.  Research  is  needed  in  all  of  these  and  other 
related  areas  to  provide  the  foundation  for  a  smooth 
and  effective  transition  to  web-based  testing. 

Several  issues  and  questions  that  were  not  men¬ 
tioned  in  this  paper  undoubtedly  will  arise  concern¬ 
ing  the  implementation  of  web-based  testing  in  gen¬ 
eral  and  CATs  in  particular.  However,  we  anticipate 
that  the  foregoing  discussion  provides  some  of  the 
broad  brush  strokes  for  framing  a  research  agenda  to 
investigate  the  validity,  feasibility,  and  appropriateness 
of  this  form  of  assessment.  A  program  of  research 
that  is  designed  to  address  these  and  other  related 
important  questions  will  require  the  expertise  of 
researchers  from  a  wide  range  of  disciplines,  includ¬ 
ing  psychometrics,  psychology,  sociology,  informa¬ 
tion  sciences,  political  science,  and  economics,  as  well 
as  input  from  educators,  test  developers,  and  policy¬ 
makers.  Information  technology  has  permeated  near¬ 
ly  all  aspects  of  our  lives,  including  education,  and  it 
is  only  a  matter  of  time  before  large-scale  testing  will 
utilize  current  technologies,  including  the  Internet.  It 


is  critical,  therefore,  that  we  begin  a  program  of  research 
now,  while  there  is  still  time  to  steer  this  process  in  a  direc¬ 
tion  that  will  improve  education  and  benefit  students. 
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